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Abstract 

We propose local distributional smoothness (LDS), a new notion of smoothness 
for statistical model that can be used as a regularization term to promote the 
smoothness of the model distribution. We named the LDS based regularization 
as virtual adversarial training (VAT). The LDS of a model at an input datapoint is 
defined as the KL-divergence based robustness of the model distribution against 
local perturbation around the datapoint. VAT resembles adversarial training, but 
distinguishes itself in that it determines the adversarial direction from the model 
distribution alone without using the label information, making it applicable to 
semi-supervised learning. The computational cost for VAT is relatively low. For 
neural network, the approximated gradient of the LDS can be computed with 
no more than three pairs of forward and back propagations. When we applied 
our technique to supervised and semi-supervised learning for the MNIST dataset, 
it outperformed all the training methods other than the cutTent state of the art 
method, which is based on a highly advanced generative model. We also applied 
our method to SVHN and NORB, and confirmed our method’s superior perfor¬ 
mance over the current state of the art semi-supervised method applied to these 
datasets. 


1 Introduction 


Overfitting is a serious problem in supervised training of classification and regression functions. 
When the number of training samples is finite, training error computed from empirical distribution 
of the training samples is bound to be different from the test etTor, which is the expectation of the 
log-likelihood with respect to the true underlying probability measure (|Akaike 
[ 2 ^ . 


|1998[ Watanabe 


One popular countermeasure against overfitting is addition of a regularization term to the objective 
function. The introduction of the regularization term makes the optimal parameter for the objective 
function to be less dependent on the likelihood term. In Bayesian framework, the regularization 
term corresponds to the logarithm of the prior distribution of the parameters, which defines the 
preference of the model distribution. Of course, there is no universally good model distribution, 
and the good model distribution should be chosen dependent on the problem we tackle. Still yet, 
our experiences often dictate that the outputs of good models should be smooth with respect to 
inputs. For example, images and time series occurring in nature tend to be smooth with respect to 
the space and time ( |Wahba 1990 1 . We would therefore invent a novel regularization term called 
local distributional smoothness (LDS), which rewards the smoothness of the model distribution 
with respect to the input around every input datapoint. 


We define LDS as the negative of the sensitivity of the model distribution p{y\x,d) with respect 
to the perturbation of x, measured in the sense of KL divergence. The objective function based on 
this regularization term is therefore the log likelihood of the dataset augmented with the sum of LDS 
computed at every input datapoint in the dataset. Because LDS is measuring the local smoothness of 
the model distribution itself, the regularization term is parametrization invariant. More precisely, our 
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regularized objective function T satisfies the natural property that, if the 9* = argmaxg T(0), then 
f{9*) = arg maxg T{f{9)) for any diffeotnorphistn /. Therefore, regardless of the parametrization, 
the optimal model distribution trained with LDS regularization is unique. This is a property that is 
not enjoyed by the popular Lg regularization (Friedman et al. j2001 r Also, for sophisticated models 


like deep neural network ( |Bengio| |2009[ |LeCun et al.||2015| , it is not a simple task to assess the 
effect of the Lg regularization term on the topology of the model distribution. 


Our work is closely related to adversarial training (Goodfellow et al. 2015| l. At each step of the train¬ 
ing, Goodfellow et al. identified for each pair of the observed input x and its label y the direction 
of the input perturbation to which the classifier’s label assignment of x is most sensitive. Goodfel¬ 
low et al. then penalized the model’s sensitivity with respect to the perturbation in the adversarial 
direction. On the other hand, our LDS defined without label information. Using the language of 
adversarial training, LDS at each point is therefore measuring the robustness of the model against 
the perturbation in local and ‘virtual’ adversarial direction. We therefore refer to our regularization 
method as virtual adversarial training (VAT). Because LDS does not require the label information, 
VAT is also applicable to semi-supervised learning. This is not the case for adversarial training. 


Furthermore, with the second order Taylor expansion of the LDS and an application of power 
method, we made it possible to approximate the gradient of the LDS efficiently. The approxi¬ 
mated gradient of the LDS can be computed with no more than three pairs of forward and back 
propagations. 

We summarize the advantages of our method below: 


• Applicability to both supervised and semi-supervised training. 

• At most two hyperparameters. 

• Parametrization invariant formulation. The performance of the method is invariant under 
reparatrization of the model. 

• Low computational cost. For Neural network in particular, the approximated gradient of 
the LDS can be computed with no more than three parrs of forward and back propagations. 


When we applied the VAT to the supervised and semi-supervised learning of the permutation invari¬ 
ant task for the MNIST dataset, our method outperformed all the contemporary methods other than 
the state of the art method ( [Rasmus et al. 2015| l that uses a highly advanced generative model based 
method. We also applied our method for semi-supervised learning of permutation invariant task for 
the SVHN and NORB dataset, and confirmed our method’s superior performance over the current 
state of the art semi-supervised method applied to these datasets. 


2 Methods 

2.1 Formalization of Local Distributional Smoothness 

We begin with the formal definition of the local distributional smoothness. Let us fix 9 for now, 
suppose the input space 3?^, the output space Q, and a training samples 

D = GQ,n = l,... ,N}, 

and consider the problem of using D to train the model distribution p{y\x, 9) parametrized by 9. Let 
KL[p| |g] denote the KL divergence between the distributions p and q. Also, with the hyperparameter 
e > 0, we define 

AKL{r,x^'^\9) = KL[p{y\x^'^\9)\\p{y\x^'^'> + r, 6»)] (1) 

^l"adv = argmax{AKL(Ax("\6»); ||r ||2 < e}. (2) 

r 

in) 

From now on, we refer to as the virtual adversarial perturbation. We define the local distribu¬ 

tional smoothing (LDS) of the model distribution at by 

LDS(x("),0) = (3) 

Note direction to which the model distribution p{y\x^'^\9) is most sensitive in the 

sense of KL divergence. In a way, this is a KL divergence analogue of the gradient Vx of the 
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model distribution with respect to the input, and perturbation of x in this direction wrecks the local 
smoothness of p[y\x^^\9) atx^"^ in a most dire way. The smaller the value of AKL(r’v?adv> 
at the smoother the p{y\x^'^\6) at x. Our goal is to improve the smoothness of the model in 
the neighborhood of all the observed inputs. Formulating this goal based on the LDS, we obtain the 
following objective function. 


N N 

- Y, logp(2/(”) LDS(a;(") , 9). 

n—1 n—1 


(4) 


We call the training based on Q the virtual adversarial training (VAT). By the construction, VAT is 
parametrized by the hyperparameters A > 0 and e > 0. If we define — argmin{p( 2 /("^|a:*^") + 


,11 flip < e} and replace — AKL(fv”advi x^^\ 9) in ([^ with \ogp( y^^^\x^^'> +fadl, obtain 

the objective function of the adversarial training (Goodfellow et al. 2015|l. Perturbation of in 

(n) - -* 

the direction of can most severely damage the probability that the model correctly assigns the 
label to As opposed to the definition of r^^idv require the correct 

label y^'^\ This property allows us to apply the VAT to semi-supervised learning. 


LDS is a definition meant to be applied to any model with distribution that are smooth with respect 
to X. For instance, for a linear regression model p{y\x, 9) = J\f{9'^x, a^') the LDS becomes 

LDS(a:,0) = -^e2||0|,2^ 

and this is the same as the form of regularization. This does not, however, mean that regular¬ 
ization and LDS is equivalent for linear models. It is not difficult to see that, when we reparametrize 

the model p{y\x,9) = ct^), we obtain LDS(x,6>^) oc —not —e^ll^lli- Fora 

logistic regression model p{y = l|a;, 9) = a{9"^x) = (1 -|- exp(— we obtain 

LDS(x,6») ^ -]^a{9'^x){l - (T(6»'^a:))e^||6<||2 

with the second-order Taylor approximation with respect to 0^r. 


2.2 Efficient evaluation of LDS and its derivative with respect to 9 

(n) 

Once is computed, the evaluation of the LDS is simply the computation of the KL divergence 

between the model distributions p{y\x^'^\ 9) and p{y\x^'^'^ + fv"idv> When p{y\x^'^\ 9) can be 
approximated with well known exponential family, this computation is straightforward. For exam¬ 
ple, one can use Gaussian approximation for many cases of NNs. In what follows, we discuss the 
efficient computation of r^?adv’ which there is no evident approximation. 

2.2.1 Evaluation of rv.adv 

We assume that p{y\x, 9) is differentiable with respect to 9 and x almost everywhere. Because 
AklCt, a:,0) takes minimum value at r = 0, the differentiability assumption dictates that its first 
derivative AklCt, x, 6*)|r=o is zero. Therefore we can take the second-order Taylor approxima¬ 
tion as 

AKL(r,x,6') = ]^r'^ H{x,9)r, (5) 

where iF(x, 9) is a Hessian matrix given by fT(x, 9) = VVr.AKL(T, x, 0)|r-=o- Under this approxi¬ 
mation Tv-adv emerges as the first dominant eigenvector of iF(x, 9), u{x, 9), of magnitude e, 

rv-adv(x,6l) ^ argmax{r'^iF(x, ejr; ||r ||2 < e} 
r 

= eu{x,9), ( 6 ) 

where “ denotes an operator acting on arbitrary non-zero vector v that returns a unit vector in the 
direction of v as v. Hereafter, we denote H{x, 9) and u{x, 9) as H and u, respectively. 
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The eigenvectors of the Hessian H(x, 0) require 0{P) computational time, which becomes unfea- 
sibly large for high dimensional input space. We therefore resort to power iteration method ( |Golub| 
& Van der Vorst[ 2000|l and finite difference method to approximate rv.adv Let d be a randomly 


sampled unit vector. As long as d is not perpendicular to the dominant eigenvector u, the iterative 
calculation of 


d •(— Hd 


(7) 


will make the d converge to u. We need to do this without the direct computation of H, however. 
Hd can be approximated by finite difference 


Hd 


\/r^KL{r, X, 9)\r=^d - rAKhjr, X, 6>)|r=0 


( 8 ) 


with ^ ^ 0. In the computation above, we used the fact Vr AKL(r’, x, 0) |r=o = 0 again. In summary, 
we can approximate rvadv with the repeated application of the following update; 


d ^ \7r^KL{r,X,9)\r=^d- (9) 

The approximation improves monotonically with the iteration times of the power method. Ip. 
Most notably, the value Wr^KLix + ^d,x,9)\r=o can be computed easily by the back propa- 

(n) (n) 

gation method in the case of neural networks. We denote the approximated as f^.adv = 

GenVAP(0, e, Ip, (See Algorithmj^, and denote the LDS computed with f^^adv 

LDS(a:("),0) = -AKL(r^?adv.^^”^0)- dO) 


Algorithm 1 Generation of 

Function GenVAP(0,a;("),e,Jp,O 

1. Initialize d S by a random unit vector. 

2. Repeat For i in 1... dp (Perform /p-times power method) 

d ^ Vr-^Kh{r,X*^^'>,9)\r=^d 

3. Return ed 


2.2.2 Evaluation of derivative oe approximated LDS w.r.t 9 


Let 9 be the current value of 9 in the algorithm. It remains to evaluate the derivative of LDS(a;^"^, 9) 
with respect to 0 at 0 = 0, or 




LDS(x("\0) 


6»=e 




( 11 ) 


e=e 


By the definition, fv-adv depends on 0. Our numerical experiments, however, indicates that V^rvadv 
is quite volatile with respect to 0, and we could not make effective regularization when we used the 
numerical evaluation of Vgfv-adv in VeLDS(a;l"\ 0). We have therefore followed the work of 
Goodfellow et al. ( 2015| l and ignored the derivative of fv-adv with respect to 0. This modification in 
fact achieved better generalization performance and higher LDS(a;l"^, 0). We also replaced the first 
0 in the KL term of ([TT]) with 0 and computed 


d 




+fv?adv>^)] 


( 12 ) 


The stochastic gradient descent based on Q with ( [T2l i was able to achieve even better generalization 
performance. Lrom now on, we refer to the training of the regularized likelihood Q based on ( fTS] ) 
as virtual adversarial training (VAT). 


’For many models including neural networks, Hd can be computed exactly i Pearlmutter 1994| >. We forgo 
this computation in this very paper, however, because the implementation of his procedure is not straightforward 
in some of the standard deepleaming frameworks (e.g. Caffe jjia et al. 2014|, Chainer jTokui et'af 2015 1 ). 
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2.3 Computational cost of computing the gradient oe LDS 

We would like to also comment on the computational cost required for ( |T2l i. We will restrict our 
discussion to the case with /p = 1 in Q , because one iteration of power method was sufficient for 
computing accurate Hd and increasing Ip did not have much effect in all our experiments. 

With the efficient computation of LDS we presented above, we only need to compute AKL(r’ + 
, 9) in order to compute Once is decided, we compute the gradient of the LDS 

with respect to the parameter with ( fl^ . For neural network, the steps that we listed above require 
only two forward propagations and two back propagations. In semi-supervised training, we would 
also need to evaluate the probability distribution p{y\x^'^\ 6) in Q for unlabeled samples. As long 
as we use the same dataset to compute the likelihood term and the LDS term, this requires only one 
additional forward propagation. Overall, especially for neural network, we need no more than three 
pairs of forward propagation and back propagation to compute the derivative approximated LDS 
with ( [I^ . 


3 Experiments 

All the computations were conducted with Theano (|Bergstra et al.| |20I0[ |Bastien et al. 2012 1 . 
Reproducing code is uploaded on https : / /github . com/takerum/vat Throughout the ex¬ 
periments on our proposed method, we used a fixed value of A = 1, and we also used a fixed value 
of Ip = 1 except for the experiments of synthetic datasets. 


3.1 Supervised learning eor the binary classification of synthetic dataset 
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(a) Moons dataset (b) Circles dataset 

Figure 1: Visualization of the synthetic datasets. Each panel depicts the total of 16 training data 
points. Red circles stand for the samples with label 1 and blue triangles stand for the samples with 
label 0. Samples with each label are prepared uniformly from the light-colored trajectory. 


We created two synthetic datasets by generating multiple points uniformly over two trajectories on 
as shown in Figure [I and linearly embedding them into 100 dimensional vector space. The 
datasets are called (a) ‘Moons’ dataset and (b) ‘Circles’ dataset based on the shapes of the two 
trajectories, respectively. 

Each dataset consists of 16 training samples and 1000 test samples. Because the number of the 
samples is very small relative to the input dimension, maximum likelhood estimation (MLE) is vul¬ 
nerable to overfitting problem on these datasets. The set of hyperparameters in each regularization 
method and the other detailed experimental settings are described in Appendix ]A. 1| We repeated 
the experiments 50 times with different samples of training and test sets, and reported the average 
of the 50 test performances. 

Our classifier was a neural network (NN) with one hidden layer consisting of 100 hidden units. We 
used ReLU Parrett et al.[|2009]|Nair & Hinton[|2010[|Glorot et al.|[2011[ ) activation function for hid¬ 
den units, and used softmax activation function for all the output units. The regularization methods 
we compared against the VAT on this dataset include L 2 regularization (L 2 -reg), dropout ( |Srivas- 
|tava et ah] |2014| l, adversarial training(Adv), and random perturbation training (RP). The random 
perturbation training is a modified version of the VAT in which we replaced Tv-adv with an e sized 
unit vector uniformly sampled from I dimensional unit sphere. We compared random perturbation 
training with VAT in order to highlight the importance of choosing the appropriate direction of the 
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perturbation. As for the adversarial training, we followed Goodfellow et al. ( 2015) 1 and determined 
the size of the perturbation r in terms of both L^o norm and L 2 norm. 


(a) Moons (b) Circles 



Figure 2: Comparison of transitions of average LDS(I) and error rate(n) between MLE and VAT 
implemented with e = 0.5 and Ip = 1. Average LDS showed in (a.I) and (b.I) were evaluated on 
the training and test samples with e = 0.5 and Ip = 5. 


Figure [^compares the learning process between the VAT and the MLE. Panels (a.I) and (b.I) show 
the transitions of the average LDS, while panels (a.II) and (b.II) show the transitions of the error rate, 
on both training and test set. The average LDS is nearly zero at the beginning, because the models 
are initially close to uniform distribution around each inputs. The average LDS then decreases 
slowly for the VAT, and falls rapidly for the MLE. Although the training etTor eventually drops to 
zero for both methods, the final test etTor of the VAT is significantly lower than that of the MLE. 
This difference suggests that a high sustained value of LDS is beneficial in alleviating the overfitting 
and in decreasing the test error. 

0.95 
0.80 
0.65 
0.50 
0.35 
0.20 
0.05 

Eigure 3: Contours of p{y = V\x,9) drawn by NNs (with ReLU activation) trained with various 
regularization methods for a single dataset of ‘Moons’. A black line represents the contour of value 
0.5. Red circles represent the data points with label 1, and blue triangles represent the data points 
with label 0. The value above each panel correspond to average LDS value. Average LDS evaluated 
on the training set with e = 0.5 and /p = 5 is shown at the top of each panel. 

0,95 
0.80 
0.65 
0.50 
0.35 
0.20 
0.05 

Eigure 4: Contours of p{y = l|a;, 9) drawn by NNs trained with various regularization methods for 
a single dataset of ‘Circles’. The rest of the details follow the caption of the Eigure 

Eiguresj^ andshow the contour plot of the model distributions for the binary classification prob¬ 
lems of ‘Moons’ and ‘Circles’ trained with the best set of hyper parameters. The value above each 
panel correspond to average LDS value. We see from the figures that NN without regularization 
(MLE) and NN with L 2 regularization are drawing decisively wrong decision boundary. The de¬ 
cision boundary drawn by dropout for ‘Circles’ is convincing, but the dropout’s decision boundary 
for ‘Moons’ does not coincide with our intention. The opposite can be said for the random pertur¬ 
bation training. Only adversarial training and VAT are consistently yielding the intended decision 


- 1.588 - 1.351 - 1.450 - 0.805 - 0.199 - 0.257 
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boundaries for both datasets. VAT is drawing appropriate decision boundary by imposing local 
smoothness regularization around each data point. This does not mean, however, that the large 
value of LDS immediately implies good decision boundary. By its very definition, LDS tends to 
disfavor abrupt change of the likelihood around training datapoint. Large value of LDS therefore 
forces large relative margin around the decision boundary. One can achieve large value of LDS with 
L 2 regularization, dropout and random perturbation training with appropriate choice of hyperpa¬ 
rameters by smoothing the model distribution globally. This, indeed, comes at the cost of accuracy, 
however. Figure [^summarizes the average test errors of six regularization methods with the best set 
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Figure 5: Comparison of the test error rate for (a) ‘Moons’ and (b) ‘Circles’ 


of hyperparameters. Adversarial training and VAT achieved much lower test errors than the other 
regularization methods. Surprisingly, the performance of the VAT did not change much with the 
value of Ip. We see that Ip = 1 suffices for our dataset. 


3.2 Supervised learning for the classification of the MNIST dataset 


Next, we tested the performance of our regularization method on the MNIST dataset, which consists 
of 28 X 28 pixel images of handwritten digits and their corresponding labels. The input dimen¬ 
sion is therefore 28 x 28 = 784 and each label is one of the numerals from 0 to 9. We split the 
original 60,000 training samples into 50,000 training samples and 10,000 validation samples, and 
used the latter of which to tune the hyperparameters. We applied our methods to the training of 
2 types of NNs with different numbers of layers, 2 and 4. As for the number of hidden units, we 
used (1200,600) and (1200,600,300,150) respectively. The ReLU activation function and batch 
normalization technique ( [Ioffe & Szegedy |2015|l were used for all the NNs. The detailed settings 
of the experiments are described in Appendix [A.2| 


For each regularization method, we used the set of hypeiparameters that achieved the best perfor¬ 
mance on the validation data to train the NN on all training samples. We applied the trained networks 
to the test set and recorded their test errors. We repeated this procedure 10 times with different seeds 
for the weight initialization, and reported the average test error values. 


Tablel^summarizes the test error obtained by our regularization method (VAT) and the other regular¬ 
ization methods. VAT performed better than all the contemporary methods except Ladder network, 
which is highly advanced generative model based method. 


3.3 Semi-supervised learning for the classification of the benchmark datasets 

Recall that our definition of LDS (Eq.([^) at any point x is independent of the label information 
y. This in particular means that we can apply the VAT to semi-supervised learning tasks. We 
would like to emphasize that this is a property not enjoyed by adversarial training and dropout. 
We applied VAT to semi-supervised learning tasks in the permutation invariant setting for three 
datasets: MNIST, SVHN, and NORB. In this section we describe the detail of our setup for the 
semi-supervised learning of the MNIST dataset, and leave the further details of the experiments in 
the Appendix |A.3| and |A.4| 

We experimented with four sizes of labeled training samples A) = {100,600,1000,3000} and 
observed the effect of TV; on the test eiTor. We used the validation set of fixed size 1000, and used 
all the training samples excluding the validation set and the labeled to train the NNs. That is, when 
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Table 1; Test etTors of 10-class supervised learning for the permutation invariant MNIST task. Stars 
* indicate the methods that are dependent on generative models or pre-training. Test errors in the 
upper panel are the ones reported in the literature, and test errors in the bottom panel are the outputs 
of our implementation. 

Method Test error (%) 


SVM (gaussian ke rnel) _ 

Gaussian dropout ( Srivastava et al.j [2 
Maxout Networks dGoodteliow et al7 


2014 


*MTC I Rifai et al.pUll 


*DBM Srivastava et at.. 
Adversarial 


’ZU13 


20141 


training |Goodlgj jow et 


al. 


*Ladder network |Rasmus^et^ 2013 


2015 


1.40 

0.95 

0.94 

0.81 

0.79 

0.782 

0.57±0.02 


Plain NN (MLE) 

Random perturbation training 
Adversarial training (with Laa norm constraint) 
Adversarial training (with L 2 norm constraint) 
VAT (ours) 


1.11 

0.843 

0.788 

0.708 

0.637±0.046 


Ni = 100, the unlabeled training set had the size of 60,000 — 100 — 1,000 = 58,900. For each 
choice of the hyperparameters, we repeated the experiment 10 times with different set. As for the 
architecture of NNs, we used ReLU based NNs with two hidden layers with the number of hidden 
units (1200,1200). Batch normalization was implemented as well. 


Table summarizes the results for the permutation invariant MNIST task. All the methods other 
than SVM and plain NN (MLE) are semi-supervised learning methods. For the MNIST dataset, 
VAT outperformed all the contemporary methods other than Ladder network (Rasmus et al. 2015| l, 
which is current state of the art. 


Table 2b and Table 2c summarizes the the results for SVHN and NORB respectively. Our method 
strongly outperforms the current state of the art semi-supervised learning method applied to these 
datasets. 


4 Discussion and Related Works 


Our VAT was motivated by the adversarial training ( Goodfellow et al.| 2015| l. Adversarial training 
and VAT are similar in that they both use the local input-output relationship to smooth the model 
distribution in the corresponding neighborhood. In contrast, L 2 regularization does not use the 
local input-output relationship, and cannot introduce local smoothness to the model distribution. 
Increasing of the regularization constant in L 2 regularization can only intensify the global smoothing 
of the distribution, which results in higher training and generalization error. The adversarial training 
aims to train the model while keeping the average of the following value high: 


i{logp(y(”) +r,ey, ||r||p < e}. 


(13) 


This makes the likelihood evaluated at the n-th labeled data point robust against e—perturbation 
applied to the input in its adversarial direction. 

PEA ( [Bachman et aH[2014| l, on the other hand, used the model’s sensitivity to random perturbation 
on input and hidden layers in their construction of the regularization function. PEA is similar to 
the random perturbation training of our experiment in that it aims to make the model distribution 
robust against random perturbation. Our VAT, however, outperforms PEA and random perturbation 
training. This fact is particularly indicative of the importance of the role of the hessian H in VAT 
(Eq.(13). Because PEA and random perturbation training attempts to smooth the distribution into any 
arbitrary direction at every step of the update, it tends to make the variance of the loss function large 
in unnecessary dimensions. On the other hand, VAT always projects the perturbation in the principal 
direction of H, which is literally the principal direction into which the distribution is sensitive. 

Deep contractive network by Gu and Rigazio ( |Gu & Rigazioj |2015| ) takes still another approach to 
smooth the model distribution. Gu et al. introduced a penalty term based on the Frobenius norm 
of the Jacobian of the neural network’s output y = f{x, 9) with respect to x. Instead of computing 
the computationally expensive full Jacobian, they approximated the Jabobian by the sum of the 
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Table 2: Test etTors of semi-supervised learning for the permutation invariant task for MNIST, 
SVHN and NORB. All values are the averages over random partitioning of the dataset. Stars * 
indicate the methods that are dependent on generative models or pre-training. 


(a) MNIST 

Test error(%) 


Ni 100 600 1000 3000 


SVM JWestonetal.j2012] 

23.44 

8.85 

7.77 

4.21 

TSVIVL Weston et al.j|2U12 

16.81 

6.16 

5.38 

3.45 

HmbeclJNIN 1 Weston et al. 

2D12 

16.9 

5.97 

5.73 

3.59 

*MTCiRifaietal.J|2Ull 


12.0 

5.13 

3.64 

2.57 

HHA 1 Hactiman et al. 2014 

10.79 

2.44 

2.23 

1.91 

(Bachman et at. 2Ui4 

5.21 

2.87 

2.64 

2.30 

*DG 1 Kingma et al.J|2t)l4l 
*Ladaer network ((Kasmus et al.|2015 

3.33 

1.06 

2.59 

2.40 

0.84 

2.18 

Plain NN (MLE) 


21.98 

9.16 

7.25 

4.32 

VAT (ours) 


2.33 

1.39 

1.36 

1.25 


(b) SVHN 


Method 


Test error(%) 


Ni 


1000 


TSVM 1 Kingma et al.J|2014l 

66.55 

*DG,mI+TSVM I Riiifena et al.l 2014 
*DG,M1+M2 i Kingma et al.JjzOMj 

55.33 

36.02 

SVM (Gaussian kernel) 

63.28 

Plain NN (MLE) 

43.21 

VAT (ours) 

24.63 

(c) NORB 


Method 

Test error(%) 

Ni 1000 

TSVM I Kingma et all 20141 

26.00 

*DG,M\+rSVM ([Kingma et al. 2014 

18.79 

SVM (Gaussian kernel) 

23.62 

Plain NN (MLE) 

20.00 

VAT (ours) 

9.88 


Frobenius norm of the Jacobian over every adjacent pairs of hidden layers. The deep contractive 
network was, however, unable to significantly decrease the test error. 


Ladder network ( [Rasmus et al. 2015| l is a method that uses layer-wise denoising autoencoder. Their 
method is currently the best method in both supervised and semi-supervised learning for permutation 
invariant MNIST task. Ladder network seems to be conducting a variation of manifold learning that 
extracts the knowledge of the local distribution of the inputs. Still another classic way to include the 
information about the the input distribution is to use local smoothing of the data like Vicinal Risk 
Minimization (VRM) ( Chapelle et al.j |200T I. VAT, on the other hand, only uses the property of the 
conditional distribution p{y\x, 9), giving no consideration to the the generative process p{x\9) nor 
to the full joint distribution p{y, x\9). In this aspect, VAT can be complementary to the methods that 
explicitly model the input distribution. We might be able to improve VAT further by introducing the 
notion of manifold learning into its framework. 


5 Conclusion 

Our experiments based on the synthetic datasets and the three real world dataset, MNIST, SVHN 
and NORB indicate that the VAT is an effective method for both supervised and semi-supervised 
learning. For the MNIST dataset, VAT outperformed all contemporary methods other than Ladder 
network, which is the cuiTent state of the art. VAT also outperformed cuiTent state of the art semi- 
supervised learning method for SVHN and NORB as well. We would also like to emphasize the 
simplicity of the method. With our approximation of LDS, VAT can be computed with relatively 
small computational cost. Also, models that relies heavily on generative models are dependent on 
many hyperparameters. VAT, on the other hand, has only two hyperparamaters, e and A. In fact, our 
experiments worked sufficiently well with the optimization of one hyperparameter e while fixing 
A = 1. 
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A Appendix:Details of experimental settings 
A. l Supervised binary classification for synthetic datasets 

We provide more details of experimental settings on synthetic datasets. We list the search space for 
the hyperparameters below: 

• L 2 regularization: regularization coefficient A = {le — 4, • • • , 200} 

• Dropout (only for the input layer): dropout rate p{z) = {0.05, • • • , 0.95} 

• Random perturbation training: e = {0.2, • • • , 4.0} 

• Adversarial training (with Lao norm constraint): e = {0.01, • • • , 0.2} 

• Adversarial training (with L^ norm constraint): e = {0.1, • • • , 2.0} 

• VAT:e = {0.1,--- ,2.0} 

All experiments with random perturbation training, adversarial training and VAT were conducted 
with A = 1. As for the training we used stochastic gradient descent (SGD) with a moment method. 
When J{9) is the objective function, the moment method augments the simple update in the SGD 
with a term dependent on the previous update A 0 i_i: 

/sei = + gLi)-^~j{0). (14) 

In the expression above, Hi G [0,1) stands for the strength of the momentum, and 7 ^ stands for the 
learning rate. In our experiment, we used pi = 0.9, and exponentially decreasing 7 ^ with rate 0.995. 
As for the choice of 71 , we used 1.0. We trained the NNs with 1,000 parameter updates. 

A.2 Supervised classieication for the MNIST dataset 

We provide more details of experimental settings on supervised classification for the MNIST dataset. 
Following lists summarizes the ranges from which we searched for the best hyperparameters of each 
regularization method: 

• Random perturbation training: e = {5.0, • • • , 15.0} 
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• Adversarial training (with Loo norm constraint): e = {0.05, • • • , 0.1} 

• Adversarial training (with L 2 norm constraint): e = {1.0, • • • , 3.0} 

• VAT:e = {1.0,--- ,3.0}, 4 = 1 

All experiments were conducted with A = 1. The training was conducted by mini-batch SGD 
based on ADAM (|Kingma & Ba |2015 1. We chose the mini-batch size of 100, and used the default 
values of Kingma & Ba ( 2015| l for the tunable parameters of ADAM. We trained the NNs with 
50,000 parameter updates. As for the base learning rate in validation, we selected the initial value 
of 0.002 and adopted the schedule of exponential decay with rate 0.9 per 500 updates. After the 
hyperparameter determination, we trained the NNs over 60,000 parameter updates. For the learning 
coefficient, we used the initial value of 0.002 and adopted the schedule of exponential decay with 
rate 0.9 per 600 updates. 


A. 3 Semi-supervised classieication eor the MNIST dataset 

We provide more details of experimental settings on semi-supervised classification for the MNIST 
dataset. We searched for the best hyperparameter e from {0.2,0.3, 0.4} in the Ni = 100 case. Best 
e was selected from {1.5, 2.0, 2.5} for all other cases. All experiments were conducted with A = 1 
and Ip = 1. For the optimization method, we again used AD AM-based minibatch SGD with the 
same hyperparameter values as those in the supervised setting. We note that, in the computation of 
ADAM, the likelihood term can be computed from labeled data only. We therefore used two separate 
minibatches at each step: one minibatch of size 100 from labeled samples for the computation of 
the likelihood term, and another minibatch of size 250 from both labeled and unlabeled samples 
for computing the regularization term. We trained the NNs over 50,000 parameter updates. For the 
learning rate, we used the initial value of 0.002 and adopted the schedule of exponential decay with 
rate 0.9 per 500 updates. 


A.4 Semi-supervised classieication eor the SVHN and NORB dataset 


We provide the details of the numerical experiment we conducted for the SVHN and NORB dataset. 
The SVHN dataset consists of 32 x 32 x 3 pixel RGB images of housing numbers and their cor¬ 
responding labels (0-9), and the number of training samples and test samples within the dataset are 
73,257 and 26,032, respectively. To simplify the experiment, we down sampled the images from 
32 X 32 X 3 to 16 X 16 X 3. We vectorized each image to 768 dimensional vector, and applied 
whitening (Coates et al. 201 l| l to the dataset. We reserved 1000 dataset for validation. From the 
rest, we used 1000 dataset as labeled dataset in semi-supervised training. We repeated the experi¬ 
ment 10 times with different choice of labeled dataset and validation dataset. 

The NORB dataset consists of 2 x 96 x 96 pixel gray images of 50 different objects and their cor¬ 
responding labels (cars, trucks, planes, animals, humans). The number of training samples and test 
samples constituting the dataset are 24,300. We downsampled the images from 2 x 96 x 96 to 
2 X 32 X 32. We vectorized each image to 2048 dimensional vector and applied whitening. We 
reserved 1000 dataset for validation. From the rest, we used 1000 dataset as labeled dataset in semi- 
supervised training. We repeated the experiment 10 times with different choice of labeled dataset 
and validation dataset. 


For both SVHN and NORB, we used neural network with the number of hidden nodes given by 
(1200,600, 300,150,150). In our setup, these two dataset preferred deeper network than the net¬ 
work we used for the MNIST dataset. We used ReLU for the activation function. As for the hyper¬ 
parameters e, we conducted grid search over the range {1.0,1.5, • • • , 4.5, 5.0}, and we used A = 1. 
For power iteration, we used 4 = 1. 

We used ADAM based minibach SGD with the same hyperparameter values as the MNIST. We 
chose minibatch size of 100. Thus one epoch for SVHN completes with [73,257/100] = 753 
rounds of minibatches. For the learning rate, we used the initial value of 0.002 with exponential 
decay of rate 0.9 per epoch. We trained NNs with 100 epochs. 
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